By default, Algolia queues all URLs that comply with the pathsToMatch and fileTypesToMatch actions, and the exclusionPatterns parameter. You can override this default logic by providing a linkExtractor function that overrides this default logic and returns its own list of URLs to queue.

Parameters

$
object

A Cheerio instance with the HTML of the crawled page. For more information, see Extracting data with Cheerio.

defaultExtractor
function

A Cheerio instance with the HTML of the crawled page. The Crawler’s default URL discovery function It returns an array of strings, each representing a URL on the page that matches the crawler’s configuration.

url
URL

URL of the page that was just crawled.

Examples

JavaScript
{
    linkExtractor: ({ $, url, defaultExtractor }) => {
    if (/example.com\/doc\//.test(url.href)) {
        // For all pages under /doc, only queue the first found link
        return defaultExtractor().slice(0,1);
    }
    // Otherwise, use the default logic (queue all found links)
    return defaultExtractor();
    },
}
JavaScript
{
linkExtractor: ({ $, url, defaultExtractor }) => {
    // This turns off link discovery, except for URLs listed in sitemap.xml
    return /sitemap.xml/.test(url.href) ? defaultExtractor() : [];
},
}
JavaScript
{
linkExtractor: ({ $ }) => {
    // Access the DOM and extract what you specify
    return [$('.my-link').attr('href')]
},
}